Bloom filter with memory element

ABSTRACT

Techniques are provided for determining if an element is contained in a set of elements. In one aspect, an element may be received and inserted into a bloom filter. The element may also be inserted into a memory associative on the bloom filter indexes. In another aspect, a search element may be received and compared to a bloom filter. If the search element is included in the bloom filter, a memory may be used to determine if the search element is included in the set of elements.

BACKGROUND

There are many situations in which determining if a specified sequenceof characters is included in a larger pool of characters is needed. Forexample, a network security appliance may examine all packets traversinga network in order to detect malicious packets. For example, a packetmay contain a portion of a virus that is identifiable by a certainsequence of characters in the packet. By examining every packet for thesequence of characters, it may be determined if the packet may containat least part of the virus. Once it is determined that a particularcharacter string is present, further action may be taken. For example,in the case of a network security appliance, the packet may be forwardedto additional logic to determine if the packet actually contains thevirus, or if the character sequence just happened to be included in thepacket for non-malicious reasons.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an example of a system utilizing a memory to eliminate falsepositives when using a bloom filter.

FIG. 2 is an example of a high level flow diagram for inserting anelement into the set of elements that are being searched for.

FIG. 3 is another example of a high level flow diagram for inserting anelement into the set of elements that are being searched for.

FIG. 4 is an example of a high level flow diagram for determining if anelement is in a set of elements.

FIG. 5 is another example of a high level flow diagram for determiningif an element is in a set of elements.

DETAILED DESCRIPTION

A need often arises to search large set of characters for the presenceof certain sequences of characters, referred to as character strings.For example, a network security device may be used to examine everypacket flowing through the network to identify malicious packets andpacket streams by examining each packet for the presence of certaincharacter strings. The presence of a “signature” character string mayindicate that the packet is part of a flow that includes a computervirus. In order to achieve this result, the network security device mustcompare the contents of every packet to a list of possible signatures todetermine if the signature is contained in the packet. A similarsituation may occur anytime a large amount of data needs to be scannedto detect the presence of specified character strings.

Such a comparison can be very resource intensive. Further exacerbatingthe problem is the fact that the vast majority of packets will not matcha signature. Thus, any processing used to analyze packets which do notmatch a signature is essentially wasted. It would be beneficial to havea mechanism that can quickly determine if a packet contains a signatureof interest. One mechanism that has been used is a bloom filter.

A bloom filter is a data structure that can be used to quickly determineif an element is included in a set. For example, a network securitydevice may have a plurality of known virus signatures. This plurality ofsignatures can be considered a set of elements. Bloom filters may beused to compare an arbitrary input to the set of elements. If the Bloomfilters return a false result, it can be guaranteed that the input isnot included in the set of elements. However, if the bloom filtersreturn a positive result, this only indicates that the input has thepossibility of being included in the set of elements. In other words,the bloom filters may produce false positive results.

Techniques provided herein overcome the problem of false positives inbloom filters. An element is received and processed through a pluralityof bloom filters. If the bloom filters indicate a negative result, itcan be guaranteed that the element is not included in the set ofelements. However, if the bloom filters return a positive result, asecond step of processing occurs. In the second step, the particularelement used in the initial application of the bloom filters areexamined to determine if the particular element of interest was actuallyinserted into the set of elements. The process is described in furtherdetail below and in conjunction with the appended figures.

FIG. 1 is an example of a system utilizing a memory to eliminate falsepositives when using a bloom filter. The system 100 may include a device110. For example, the device 110 may be included in a network securityappliance. However, the techniques described herein are applicableanywhere that bloom filters may be used to determine inclusion in a set.The device may typically be implemented in hardware, such as in anApplication Specific Integrated Circuit (ASIC). Although shown as astandalone device, it should be understood that the capabilitiesdescribed herein may be included within hardware that also providesother functionality.

The device 110 may include processing logic 112. The processing logicmay be implemented using discrete logic gates, processors, fieldprogrammable gate arrays, or any other type of logic. The processinglogic may include receive logic 114, bloom filter index logic 116,insert logic 118, and compare logic 120. The various logic blocks may beused to implement the techniques described herein. The operation ofthese logic blocks is described in further detail below.

The device 100 may also include memory 122 storing various datastructures. The data structures can include bloom filters 124 and a setelement structure 126. The bloom filters may be used to determine if anelement is potentially included in a set, while the set elementstructure may be used to eliminate false positives that result from thebloom filters. The data structures 124 and 126 are also depicted inexpanded form as elements 124-a and 126-a respectively.

In order to aid in the description of the techniques presented herein,the following operational example is provided. Assume at the beginningof the example that the set of elements, and as such the bloom filters,are empty. Also assume that the set elements structure is empty. Inother words, in the beginning, the device is not searching for anyelements. In order to add an element to the bloom filters, the elementis first received by the receive logic 114. The element to be insertedmay be referred to as the insertion element. The insertion element maybe processed by a number of hash functions 116-a. In the presentexample, there are four hash functions shown, however it should beunderstood that this is for purposes of explanation and not limitation.These hash functions may be implemented in the bloom filter index logic116. As shown, input 130, which consists of the characters ABC is sentto the hash functions 116-a. The result of this operation is that eachhash function produces a value for the input. Each of these values maybe referred to as a bloom filter index. As shown, input 130 (ABC) hasproduced bloom filter indexes 2, 10, 6, and 15. It should be understoodthat this is an ordered set of indexes. For purposes of thisdescription, groupings of four values within parenthesis indicate theresults of the bloom filter index logic.

The insert logic 118 may be used to insert the element. In order toinsert the element into the bloom filters, the bloom filter indexes areexamined. For each hash function, the entry identified by the index isset to true by the insert logic in the bloom filter associated with theparticular hash. For example, the input ABC produced bloom filterindexes 2, 10, 6, and 15 for hash functions 0, 1, 2, and 3 respectively.Thus, looking at bloom filters 124-a, it can be seen that indexes 2, 10,6, and 15 have been set to true for their respective bloom filters.

In addition to setting the index values in the bloom filters, the setelements structure is also set to reflect the insertion of the elementABC. In one example implementation, the set element structure isimplemented as a data structure in memory, with at least a portion ofthe memory being a content addressable memory (CAM). The ordered bloomfilter indexes may be inserted into the content addressable portion ofthe memory. The particular element that is being inserted may also beadded to the set elements structure and is associated with the orderedindexes that was inserted into the CAM portion of the memory. Finally, arule may be associated with the set element. The rule may specify whataction is to be taken upon a match of the particular set element. Forexample, the rule may indicate that further processing on a matchingelement is needed. The particular actions of a rule are relativelyunimportant and are dependent on the application utilizing thetechniques described herein.

In the present example, it can be seen that element ABC has been addedto the set element structure as entry 150. In order to insert theelement, it is first determined if the indices hit on an existing CAMentry. If not, an empty entry is found. A non-empty entry is describedin further detail below. The particular indexes (2, 10, 6, 15) are thenplaced in the bloom filter indices CAM. The actual element is alsostored and is associated with the CAM entry. Finally, the rule for theset element ABC is associated with the set element. This same processmay occur for any elements that are to be added to the bloom filter. Forexample, element XYZ (7, 4, 9, 2) 151 causes the corresponding indicesin the bloom filters 124-a to be set to true. In addition, the setelement structure is modified such that the indices are placed in theCAM, the element is associated with the CAM entry and the ruleassociated with the element is also associated with the element.

The same process can occur for elements DEF (4, 14, 1, 4) 152 and LMO(13, 0, 12, 8) 153. The next element, QRS, when processed by the hashfunctions results in indexes 13, 0, 12, and 8. It should be noted thatthese index values happen to be the same as the indexes produced forelement LMO. When adding the element QRS to the set element structure,there is no need to find an empty CAM entry, as an entry already exists.The new element, QRS 154 may simply be associated with the previouslyadded entry that is associated with element LMO's 153 CAM entry. In suchcases, both elements are associated with the same CAM entry.

Now that the bloom filter and set element structure have been populated,the use of the system to quickly check for the presence of an element inthe set of elements while eliminating the possibility of false positivesmay now be explained. For purposes of this explanation, several examplesare presented to describe various use cases. In the first example,assume that it is desired to know if element PQR is included in the setof elements inserted into the bloom filter. In other words, is stringPQR included in the set of strings that are being searched for. Assumethat the bloom filter indexes generated by the bloom filter index logic116 for the string PQR are (6, 9, 12, 15). The compare logic 120 may beused to determine if the element PQR has been inserted into the bloomfilters. As should be clear, determining that PQR is not in the set ofelements is as simple as examining the first bloom filter. Because index6 in the bloom filter associated with hash 0 is not set to true in thebloom filters, there is no possibility of the element being included inthe set (otherwise the index would have been set to true).

The second example presents a more interesting case. Assume that it isdesired to determine if element QWE is included in the set elements.Assume element QWE has bloom filter indices (2, 0, 9, 15). Here, each ofthe indexes in the bloom filters corresponding to the element QWE hasbeen set, however, not by the element QWE. For example, in the bloomfilter associated with hash 0, index 2 was set by element ABC. The bloomfilter associated with hash 1 had index 0 set by both elements LMO andQRS. The bloom filter associated with hash 2 had index 9 set by elementXYZ. The bloom filter associated with Hash 3 had index 15 set by elementABC. Thus, even though element QWE is not included in the set ofelements, the bloom filters only mechanism would have indicated thatelement QWE was in the set of elements. In other words, element QWEwould have resulted in a false positive.

The techniques described herein overcome this problem of false positivesthrough the set element structure. Once an element has passed throughthe bloom filters and has been determined as possibly being included inthe set of elements, the set element structure may be examined bycompare logic 120 to determine if the element was actually inserted orif it is a false positive. Here, the ordered index for element QWE is(2, 0, 9, 15). Because that particular ordering of indexes has not beenentered into the CAM, there will be no CAM hit when the set elementstructure is accessed. As such, this means that the particularcombination of indexes produced by element QWE was never inserted intothe set of elements, and is not included in the set of elements. Thus,the false positive produced by the bloom filter has been overcome.

In addition to overcoming false positives as has been described above,the techniques described herein also prevent false positives in the casewhere two elements just so happen to have bloom filter indexes that arethe same, but only one of the elements was inserted into the set ofelements. For example, assume that element PDQ has bloom filter indexes(2, 10, 6, 15) which just so happens to be exactly the same as elementABC. Element ABC is included in the set. The bloom filter analysis willindicate element PDQ as potentially being included in the set ofelements. The CAM search will result in a hit on ordered indexes (2, 10,6, 15) because that set of ordered indexes was inserted by element ABC.An additional comparison is done by the compare logic between theelement and the elements associated with the CAM entry. In this example,CAM entry 150 is only associated with element ABC, and as such elementPDQ is not included in the set of elements. The false positive is onceagain eliminated.

The techniques described herein are also useful when two elements havethe same ordered bloom filter indexes and are both included in the setof elements. For example, element QRS has indexes (13, 0, 12, 8). Thebloom filters would indicate this element as possibly being within theset of elements. A CAM search would result in a hit 153 on thoseindexes. The element QRS may then be compared to all elements associatedwith this CAM entry. Here, both elements QRS and LMO are associated withCAM entry 153. Thus, because the element QRS matches, it can bedetermined that the element is within the set of elements.

Although the above description was based on using a CAM that stores thebloom filter indexes of an element, an alternate example implementationmay directly store the element in the CAM. Thus, once the bloom filtershave determined that an element may potentially be included in the setof elements, the CAM structure may be searched using the element itself.In such an implementation, the step of locating the CAM entry and thencomparing the element associated with the CAM entry to the searchelement can be avoided.

In yet another example implementation (not shown), the CAM portion ofthe memory may be eliminated completely. The ordered indexes may behashed to obtain a location in memory. The elements (and thecorresponding rules) associated with the hash of the ordered indexes maybe stored starting at the location in memory. Once the bloom filtershave determined that an element may be included in the set of elements,the ordered bloom filter indexes may be hashed to determine the locationin memory. All elements associated with the location in memory may thenbe compared to the search element to determine if the search element isincluded in the set of elements.

This two stage approach eliminates the possibility of false positiveswithout requiring excessive amounts of processing. The bloom filters maytake care of the majority of cases independently. If an element isdefinitively not in the set, this can be determined by the bloom filtersalone. It is only in the cases where the element has a possibility ofbeing in the set of elements that the second stage of processing occurs.Thus, it is likely that the majority of cases do not even reach thesecond stage of processing, thus reducing the amount of processingneeded for the majority of cases. In addition, because the set elementsstructure is efficiently organized, the elimination of false positives,when needed, is also efficient.

The above description was simplified to relatively small bloom filterswith a small number of hash functions for purposes of ease ofdescription only. The techniques described herein are applicableregardless of the size of the bloom filters or of the number of hashfunctions. In fact, an actual implementation may include indexes in therange of tens of thousands with significantly larger number of hashfunctions. The techniques described herein are not limited by theselected size of the bloom filter or number of hash functions.

Furthermore, the above description was presented in terms of elementsthat were a fixed number of characters, and the number of characters wasthe same for all elements. This was for purposes of explanation only.The techniques described herein are applicable regardless of the lengthof the element that is being searched for or if the elements havedifferent lengths. What should be understood is that the results of theinitial hash are compared to the bloom filter to determine if theelement is included in the set of elements. The creation of the hashresults are not dependent on the length of the input.

In addition, the techniques above are applicable in any situationwherein a bloom filter may be employed. Although the example of anetwork security device was mentioned, it should be understood that thetechniques are applicable to any use of bloom filters.

FIG. 2 is an example of a high level flow diagram for inserting anelement into the set of elements that are being searched for. In block210 a set element to be included in a set of elements may be received.For example, this element may be an element that is to be inserted intoa bloom filter in accordance with the techniques described herein. At alater point in time, the bloom filters may be queried to determine ifthis element has been added to the set. In block 220 a plurality ofbloom filter indexes may be computed based on the set element. In otherwords the index values for each of the bloom filters may be determined.In some implementations, the computation may be done through a hashingfunction. Regardless of implementation, a plurality of bloom filterindexes may be determined.

In block 230, the set element may be inserted into the bloom filterusing the plurality of bloom filter indexes. In other words, thecomputed bloom filter indexes are used to determine which entries in thebloom filters will have their bits set to a true value and which oneswill remain null. In block 240, the set element may be stored in amemory, wherein the element is accessed in the memory to eliminate falsepositives. As explained above, when the bloom filters determine that anelement may possibly be included in the set of elements, accessing thememory may be used to eliminate a false positive.

FIG. 3 is another example of a high level flow diagram for inserting anelement into the set of elements that are being searched for. In block310, just as above, an element to be included in a set of elements maybe received. This set element is to be inserted into the bloom filters.In block 320, a plurality of bloom filter indexes may be computed, basedon the element. In block 330, the element may be inserted into the bloomfilters using the plurality of bloom filter indexes.

In block 340 it may be determined if the implementation is using a CAM.If so, the process moves to block 345. In block 345, an empty entry inthe memory may be located, wherein at least a portion of the memory is aCAM. In block 350 it may be determined if the implementation is storingindexes or the element itself in the CAM. If the bloom filter indexesare being stored in the CAM, the process moves to block 355. In block355, the ordered set of bloom filter indexes is stored in the CAM of thelocated entry. In block 360, the element is stored, associated with thelocated entry. If it is determined in block 350 that indexes are notbeing stored, the process moves to block 365. In block 365, the elementis stored in the CAM of the located entry.

If it is determined in block 340 that the implementation is not using aCAM, the process moves to block 370. In block 370, a hash of the orderedset of indexes is computed. In block 375, memory associated with thehash is identified. In block 380, the element is added to the identifiedmemory.

FIG. 4 is an example of a high level flow diagram for determining if anelement is in a set of elements. In block 410, a search element may bereceived. The search element is the element that may be checked againstthe bloom filters to determine if the search element may exist in theset of elements. In block 420, bloom filter indexes may be computed forthe search element. As explained above, in one implementation, the bloomfilter indexes may be computed using a series of hash functions.However, it should be understood that any other method of computingbloom filter indexes would also be suitable.

In block 430, the computed bloom filter indexes may be compared to aplurality bloom filters to determine if the search element is notincluded in the set of elements. As explained above, if any of the bloomfilters indexed by the computed indexes do not contain a true value,then the search element can definitively be determined to not beincluded in the set of elements. However, if all of the bloom filterindexes do contain a true value, it can be determined that the searchelement has the possibility of being included in the set of elements.

In block 440, when the search element is not indicated as not beingincluded in the set of elements, the search element may be compared to amemory to determine if the search element is included in the set ofelements. In other words, if the bloom filters indicate a possibility ofthe element being included in the set of elements, the memory may beaccessed to determine if the element is actually included in the set ofelements or if it is a false positive.

FIG. 5 is another example of a high level flow diagram for determiningif an element is in a set of elements. In block 510, just as above, asearch element may be received. In block 520, again as above, bloomfilter indexes for the search element may be computed. In block 530, thecomputed bloom filter indexes may be compared to bloom filters todetermine if the search element is possible included in the set ofelements.

In block 535, it may be determined if there was a match in block 530. Ifthere was no match, the search element is not included in the set ofelements. As such, the process moves to block 540, in which it isdetermined that the search element is not in the set of elements becausethe computed bloom filter indexes are not in the bloom filter. If thedetermination in block 535 is that the computed bloom filter indexes areincluded in the bloom filter, the process moves to block 545.

In block 545 it is determined if this implementation utilizes a CAM. Ifso, the process moves to block 550. In block 550 it is determined if theindexes are stored in the CAM. If so, the process moves to block 555. Inblock 555, it may be determined, using a CAM associative on the orderedset of indexes, that an entry exists for the ordered set of indexes. Inblock 560, it may be determined that the search element is associatedwith the entry. If the search element is associated with the entry, thismeans that the search element is included in the set of elements.

If it is determined in block 550 that the indexes are not stored, theprocess moves to block 565. In block 565 it may be determined, using aCAM associative on the search element that an entry exists for thesearch element. If an entry exists, then this means that the searchelement is included in the set of elements.

If it is determined in block 545 that the implementation does not use aCAM, the process moves to block 570. In block 570, a hash of the orderedset of indexes may be computed. In block 575 an entry from the memorymay be retrieved based on the hash. In block 580, it may be determinedif the search element is associated with the retrieved entry. If so,this indicates that the search element is included in the set ofelements.

1. A method comprising: receiving an element to be included in a set ofelements; computing a plurality of bloom filter indexes based on theelement; inserting the element into a plurality of bloom filters usingthe plurality of bloom filter indexes; and storing the element in amemory, wherein the element is accessed in the memory to eliminate falsepositives.
 2. The method of claim 1 wherein the plurality of bloomfilter indexes is an ordered set of indexes.
 3. The method of claim 2wherein storing the element in the memory further comprises: locating anempty entry in the memory, wherein at least a portion of the entry is acontent addressable memory; storing the ordered set of bloom filterindexes in the content addressable memory portion of the located entry;and storing the element associated with the located entry.
 4. The methodof claim 2 wherein storing the element in the memory further comprises:locating an empty entry in the memory, wherein at least a portion of theentry is a content addressable memory; and storing the element in thecontent addressable memory portion of the located entry.
 5. The methodof claim 2 wherein storing the element in the memory further comprises:computing a hash of the ordered set of indexes; identifying the memoryassociated with the hash; and adding the element to the identifiedmemory. 6.-15. (canceled)